{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "o7JCvRK9fM12"
   },
   "source": [
    "<center>\n",
    "    <p style=\"text-align:center\">\n",
    "        <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg\" width=\"200\"/>\n",
    "        <br>\n",
    "        <a href=\"https://arize.com/docs/phoenix/\">Docs</a>\n",
    "        |\n",
    "        <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n",
    "        |\n",
    "        <a href=\"https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email\">Community</a>\n",
    "    </p>\n",
    "</center>\n",
    "<h1 align=\"center\">Phoenix Python Tutorial - Tracing, Evaluation, and Experimentation</h1>\n",
    "\n",
    "In this tutorial, we’ll build a simple travel agent using the Agno framework and OpenAI models.\n",
    "1. We’ll start by installing the required OpenInference packages and setting up tracing with Arize Phoenix.\n",
    "\n",
    "2. Next, we’ll configure a dataset for the agent and upload it to Phoenix, define LLM-based evaluators using the Phoenix Evals Library, and run an experiment to assess the agent’s performance.\n",
    "\n",
    "3. We’ll then review the results, iterate on the agent, re-run the experiment, and observe the improvements.\n",
    "\n",
    "This end-to-end walkthrough shows how to use Phoenix to trace, evaluate, and systematically improve your applications and agents.\n",
    "\n",
    "### **You will need a free Arize Phoenix Cloud account, a free Tavily API key, and an OpenAI API key.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "h4ScTiU8rCOq"
   },
   "source": [
    "# Install Dependencies and Set Up Keys"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -qqqqq arize-phoenix openai agno openinference-instrumentation-openai openinference-instrumentation-agno httpx"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from getpass import getpass\n",
    "\n",
    "os.environ[\"PHOENIX_COLLECTOR_ENDPOINT\"] = globals().get(\"PHOENIX_COLLECTOR_ENDPOINT\") or getpass(\n",
    "    \"🔑 Enter your Phoenix Endpoint: \"\n",
    ")\n",
    "\n",
    "os.environ[\"PHOENIX_API_KEY\"] = globals().get(\"PHOENIX_API_KEY\") or getpass(\n",
    "    \"🔑 Enter your Phoenix API Key: \"\n",
    ")\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = globals().get(\"OPENAI_API_KEY\") or getpass(\n",
    "    \"🔑 Enter your OpenAI API Key: \"\n",
    ")\n",
    "\n",
    "os.environ[\"TAVILY_API_KEY\"] = globals().get(\"TAVILY_API_KEY\") or getpass(\n",
    "    \"🔑 Enter your Tavily API Key: \"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "4nVh1WfRrW-j"
   },
   "source": [
    "The `register` function from `phoenix.otel` sets up instrumentation so your agent will automatically sends traces to Phoenix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from phoenix.otel import register\n",
    "\n",
    "tracer_provider = register(auto_instrument=True, project_name=\"python-phoenix-tutorial\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ONpHM8MJrlzy"
   },
   "source": [
    "Here, we will grab the `tracer` object to manually instrument some functions later on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from opentelemetry import trace\n",
    "\n",
    "tracer = trace.get_tracer(__name__)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "J-6ndtIcAQeB"
   },
   "source": [
    "# Define Agent"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "kQd5Vu70rtMB"
   },
   "source": [
    "In this section, we’ll build our travel agent. Users will be able to describe their destination, travel dates, and interests, and the agent will generate a customized, budget-conscious itinerary."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "c48ggJlxEAIS"
   },
   "source": [
    "## Define Tools"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "eJBxKuCVrx5k"
   },
   "source": [
    "First, we’ll define a few helper functions to support our agent's tools. In particular, we’ll use **Tavily Search** to help the tools gather general information about each destination and **Open-Mateo** to get the weather."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# --- Helper functions for tools ---\n",
    "import httpx\n",
    "\n",
    "\n",
    "@tracer.chain(name=\"search-api\")\n",
    "def _search_api(query: str) -> str | None:\n",
    "    \"\"\"Try Tavily search first, fall back to None.\"\"\"\n",
    "    api_key = os.getenv(\"TAVILY_API_KEY\")\n",
    "\n",
    "    resp = httpx.post(\n",
    "        \"https://api.tavily.com/search\",\n",
    "        json={\n",
    "            \"api_key\": api_key,\n",
    "            \"query\": query,\n",
    "            \"max_results\": 3,\n",
    "            \"search_depth\": \"basic\",\n",
    "            \"include_answer\": True,\n",
    "        },\n",
    "        timeout=8,\n",
    "    )\n",
    "    data = resp.json()\n",
    "    answer = data.get(\"answer\", \"\") or \"\"\n",
    "    snippets = [item.get(\"content\", \"\") for item in data.get(\"results\", [])]\n",
    "\n",
    "    combined = \" \".join([answer] + snippets).strip()\n",
    "    return combined[:400] if combined else None\n",
    "\n",
    "\n",
    "@tracer.chain(name=\"weather-api\")\n",
    "def _weather(dest):\n",
    "    g = httpx.get(f\"https://geocoding-api.open-meteo.com/v1/search?name={dest}\")\n",
    "    if g.status_code != 200 or not g.json().get(\"results\"):\n",
    "        return \"\"\n",
    "    lat, lon = g.json()[\"results\"][0][\"latitude\"], g.json()[\"results\"][0][\"longitude\"]\n",
    "    w = httpx.get(\n",
    "        f\"https://api.open-meteo.com/v1/forecast?latitude={lat}&longitude={lon}&current_weather=true\"\n",
    "    ).json()\n",
    "    cw = w.get(\"current_weather\", {})\n",
    "    return f\"Weather now: {cw.get('temperature')}°C, wind {cw.get('windspeed')} km/h.\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "n_vrHFuXEIkQ"
   },
   "source": [
    "Our agent will have access to three tools:\n",
    "\n",
    "1. Essential Info – Provides key travel details about the destination, such as weather and general conditions.\n",
    "\n",
    "2. Budget Basics – Offers insights into travel costs and helps plan budgets based on selected activities.\n",
    "\n",
    "3. Local Flavor – Recommends unique local experiences and cultural highlights."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from agno.tools import tool\n",
    "\n",
    "\n",
    "@tool\n",
    "def essential_info(destination: str) -> str:\n",
    "    \"\"\"Get essential info using Search and Weather APIs\"\"\"\n",
    "    parts = []\n",
    "\n",
    "    q = f\"{destination} travel essentials weather best time top attractions etiquette\"\n",
    "    s = _search_api(q)\n",
    "    if s:\n",
    "        parts.append(f\"{destination} essentials: {s}\")\n",
    "    else:\n",
    "        parts.append(\n",
    "            f\"{destination} is a popular travel destination. Expect local culture, cuisine, and landmarks worth exploring.\"\n",
    "        )\n",
    "\n",
    "    weather = _weather(destination)\n",
    "    if weather:\n",
    "        parts.append(weather)\n",
    "\n",
    "    return f\"{destination} essentials:\\n\" + \"\\n\".join(parts)\n",
    "\n",
    "\n",
    "@tool\n",
    "def budget_basics(destination: str, duration: str) -> str:\n",
    "    \"\"\"Summarize travel cost categories.\"\"\"\n",
    "    q = f\"{destination} travel budget average daily costs {duration}\"\n",
    "    s = _search_api(q)\n",
    "    if s:\n",
    "        return f\"{destination} budget ({duration}): {s}\"\n",
    "    return f\"Budget for {duration} in {destination} depends on lodging, meals, transport, and attractions.\"\n",
    "\n",
    "\n",
    "@tool\n",
    "def local_flavor(destination: str, interests: str = \"local culture\") -> str:\n",
    "    \"\"\"Suggest authentic local experiences.\"\"\"\n",
    "    q = f\"{destination} authentic local experiences {interests}\"\n",
    "    s = _search_api(q)\n",
    "    if s:\n",
    "        return f\"{destination} {interests}: {s}\"\n",
    "    return f\"Explore {destination}'s unique {interests} through markets, neighborhoods, and local eateries.\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "EvTXGk2yE_x9"
   },
   "source": [
    "## Build the Agent"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "imjauE8UsCsm"
   },
   "source": [
    "Next, we’ll construct our agent. The Agno framework makes this process straightforward by allowing us to easily define key parameters such as the model, instructions, and tools."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from agno.agent import Agent\n",
    "from agno.models.openai import OpenAIChat\n",
    "\n",
    "# --- Main Agent ---\n",
    "trip_agent = Agent(\n",
    "    name=\"TripPlanner\",\n",
    "    role=\"AI Travel Assistant\",\n",
    "    model=OpenAIChat(id=\"gpt-4o-mini\"),\n",
    "    instructions=(\n",
    "        \"You are a friendly and knowledgeable travel planner. \"\n",
    "        \"Combine multiple tools to create a trip plan including essentials, budget, and local flavor. \"\n",
    "        \"Keep the tone natural, clear, and under 1000 words.\"\n",
    "    ),\n",
    "    markdown=True,\n",
    "    tools=[essential_info, budget_basics, local_flavor],\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# --- Example usage ---\n",
    "destination = \"Tokyo\"\n",
    "duration = \"5 days\"\n",
    "interests = \"food, culture\"\n",
    "\n",
    "query = f\"\"\"\n",
    "Plan a {duration} trip to {destination}.\n",
    "Focus on {interests}.\n",
    "Include essential info, budget breakdown, and local experiences.\n",
    "\"\"\"\n",
    "\n",
    "\n",
    "def agent_task(query):\n",
    "    trip_agent.print_response(query, stream=True)\n",
    "\n",
    "\n",
    "agent_task(query)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dxOJWC5SASD3"
   },
   "source": [
    "# Define Dataset to Run on the Agent"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dEnXKGlPsXFq"
   },
   "source": [
    "In order to experiment with our agent, we first need to define a dataset for it to run on. This provides a standardized way to evaluate the agent’s behavior across consistent inputs.\n",
    "\n",
    "In this example, we’ll use a small dataset of ten examples and upload it to Phoenix using the Phoenix Client. Once uploaded, all experiments will be tracked alongside this dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from phoenix.client import Client\n",
    "\n",
    "client = Client()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# --- Example queries ---\n",
    "queries = [\n",
    "    \"Plan a 7-day trip to Italy focused on art, history, and local food. Include essential travel info, a budget estimate, and key attractions in Rome, Florence, and Venice.\",\n",
    "    \"Create a 4-day itinerary for Seoul centered on K-pop, fashion districts, and street food. Include transportation tips and a mid-range budget.\",\n",
    "    \"Plan a romantic 5-day getaway to Paris with emphasis on museums, wine tasting, and scenic walks. Provide cost estimates and essential travel notes.\",\n",
    "    \"Design a 3-day budget trip to Mexico City focusing on food markets, archaeological sites, and nightlife. Include daily cost breakdowns.\",\n",
    "    \"Prepare a 6-day itinerary for New Zealand’s South Island with a focus on outdoor adventure, hikes, and photography spots. Include travel logistics and gear essentials.\",\n",
    "    \"Plan a 10-day trip across Spain, hitting Barcelona, Madrid, and Seville. Focus on architecture, tapas, and cultural festivals. Include a detailed budget.\",\n",
    "    \"Create a 5-day family-friendly itinerary for Singapore with theme parks, nature activities, and kid-friendly dining. Include entry fees and transit costs.\",\n",
    "    \"Plan a 4-day luxury spa and relaxation trip to Bali. Include premium resorts, wellness activities, and a high-end budget.\",\n",
    "    \"Design a 7-day solo backpacking trip through Thailand with hostels, street food, and cultural attractions. Provide safety essentials and budget breakdown.\",\n",
    "    \"Create a 3-day weekend itinerary for New York City focusing on art galleries, rooftop restaurants, and iconic attractions. Include estimated costs.\",\n",
    "]\n",
    "\n",
    "dataset_df = pd.DataFrame(data={\"input\": queries})\n",
    "\n",
    "dataset = client.datasets.create_dataset(\n",
    "    dataframe=dataset_df, name=\"travel-questions\", input_keys=[\"input\"]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "aAlhegXqs8fm"
   },
   "source": [
    "![Dataset Uploaded to Phoenix](https://storage.googleapis.com/arize-phoenix-assets/assets/images/end-to-end-python-tutorial-dataset.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "SqtX1nZbATfb"
   },
   "source": [
    "# Define Evaluators"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "7Jha-WB3tBQD"
   },
   "source": [
    "Next, we need a way to assess the agent’s outputs. This is where Evals come in. Evals provide a structured method for measuring whether an agent’s responses meet the requirements defined in a dataset—such as accuracy, relevance, consistency, or safety.\n",
    "\n",
    "In this tutorial, we will be using LLM-as-a-Judge Evals, which rely on another LLM acting as the evaluator. We define a prompt describing the exact criteria we want to evaluate, and then pass the input and the agent-generated output from each dataset example into that evaluator prompt. The evaluator LLM then returns a score along with a natural-language explanation justifying why the score was assigned.\n",
    "\n",
    "This allows us to automatically grade the agent’s performance across many examples, giving us quantitative metrics as well as qualitative insight into failure cases."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "bXxNQSFLvrL2"
   },
   "source": [
    "![Phoenix Evals](https://storage.googleapis.com/arize-phoenix-assets/assets/images/phoenix_evals_diagram.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ANSWER_RELEVANCE_PROMPT_TEMPLATE = \"\"\"\n",
    "You will be given a travel-planning query and an itinerary answer. Your task is to decide whether\n",
    "the answer correctly follows the user's instructions. An answer is \"incorrect\" if it contradicts,\n",
    "ignores, or fails to include required elements from the query (such as trip length, destination,\n",
    "themes, budget details, or essential info). It is also \"incorrect\" if it adds irrelevant or\n",
    "contradictory details.\n",
    "\n",
    "    [BEGIN DATA]\n",
    "    ************\n",
    "    [Query]: {{input}}\n",
    "    ************\n",
    "    [Answer]: {{output}}\n",
    "    ************\n",
    "    [END DATA]\n",
    "\n",
    "Explain step-by-step how you determined your judgment. Then provide a final LABEL:\n",
    "- Use \"correct\" if the answer follows the query accurately and fully.\n",
    "- Use \"incorrect\" if it deviates from the query or omits required information.\n",
    "\n",
    "Your final output must be only one word: \"correct\" or \"incorrect\".\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "AyXIfJ1UthNc"
   },
   "source": [
    "After defining the evaluator prompt, we use the Phoenix Evals library to construct an evaluator instance. Notice that the evaluator LLM is separate from the model powering the agent itself."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from phoenix.evals import create_classifier\n",
    "from phoenix.evals.llm import LLM\n",
    "\n",
    "llm = LLM(provider=\"openai\", model=\"gpt-5\")\n",
    "\n",
    "relevancy_evaluator = create_classifier(\n",
    "    name=\"ANSWER RELEVANCE\",\n",
    "    llm=llm,\n",
    "    prompt_template=ANSWER_RELEVANCE_PROMPT_TEMPLATE,\n",
    "    choices={\"correct\": 1.0, \"incorrect\": 0.0},\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "BUDGET_CONSISTENCY_PROMPT_TEMPLATE = \"\"\"\n",
    "You will be given a travel-planning query and an itinerary answer. Your task is to determine whether\n",
    "the answer provides a consistent and mathematically coherent budget. An answer is \"incorrect\" if:\n",
    "- The summed minimum costs of all listed budget categories exceed the stated minimum total estimate.\n",
    "- The summed maximum costs of all listed budget categories exceed the stated maximum total estimate.\n",
    "- The total estimate claims a range that cannot be derived from (or contradicted by) the itemized ranges.\n",
    "- The answer lists budget items but provides a total that is not numerically aligned with them.\n",
    "- The answer contradicts itself regarding pricing or cost ranges.\n",
    "\n",
    "    [BEGIN DATA]\n",
    "    ************\n",
    "    [Query]: {{input}}\n",
    "    ************\n",
    "    [Answer]: {{output}}\n",
    "    ************\n",
    "    [END DATA]\n",
    "\n",
    "Explain step-by-step how you evaluated the itemized costs and the final total, including whether\n",
    "the ranges mathematically match. Then provide a final LABEL:\n",
    "- Use \"correct\" if the budget totals are consistent with the itemized values.\n",
    "- Use \"incorrect\" if the totals contradict or cannot be derived from the itemized values.\n",
    "\n",
    "Your final output must be only one word: \"correct\" or \"incorrect\".\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "llm = LLM(provider=\"openai\", model=\"gpt-5\")\n",
    "\n",
    "budget_evaluator = create_classifier(\n",
    "    name=\"BUDGET CONSISTENCY\",\n",
    "    llm=llm,\n",
    "    prompt_template=BUDGET_CONSISTENCY_PROMPT_TEMPLATE,\n",
    "    choices={\"correct\": 1.0, \"incorrect\": 0.0},\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "4UztPJWNUNJw"
   },
   "source": [
    "# Run Experiment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "GV9wX7v9uMM4"
   },
   "source": [
    "The last step before running our experiment is to explicitly define the task. Although we know we are evaluating an agent, we must wrap simple function that takes an input and returns the agent’s output. This function becomes the task that the experiment will execute for each example in the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def agent_task(input):\n",
    "    query = input[\"input\"]\n",
    "    response = trip_agent.run(query, stream=False)\n",
    "    return response.content"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "sVCYu8iDuW_W"
   },
   "source": [
    "After defining the task, we construct our experiment by providing the dataset, the evaluators, and any relevant metadata. Once everything is configured, we can run the experiment.\n",
    "\n",
    "Depending on the size of the dataset, complexity of the task, and the number of evaluators, the run may take a few minutes to complete."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from phoenix.client.experiments import run_experiment\n",
    "\n",
    "experiment = run_experiment(\n",
    "    dataset=dataset,\n",
    "    task=agent_task,\n",
    "    experiment_name=\"inital run\",\n",
    "    evaluators=[relevancy_evaluator, budget_evaluator],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Pi-E8dufAVC5"
   },
   "source": [
    "# Analyze Experiment Traces & Results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dybEa3Tg5Y9Q"
   },
   "source": [
    "Now that the experiment has run on each dataset example, you can explore the results in Phoenix. You’ll be able to:\n",
    "\n",
    "- View the full trace emitted by the agent and step through each action it took\n",
    "\n",
    "- Inspect evaluation outputs for every example, including scores, labels, and explanations\n",
    "\n",
    "- Examine the evaluation traces themselves (i.e., the LLM-as-a-Judge reasoning)\n",
    "\n",
    "- Review aggregate evaluation scores across the entire dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "OCt9KTRj9EoO"
   },
   "source": [
    "<video controls width=\"100%\">\n",
    "  <source src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/end-to-end-python-tutorial-experiment.mp4\" type=\"video/mp4\">\n",
    "  Your browser does not support the video tag.\n",
    "</video>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "kjkwUXL79zmU"
   },
   "source": [
    "### Human Annotations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "14cO2NMF9GXK"
   },
   "source": [
    "In addition to Evals, you can also add human annotations to your traces. These allow you to capture strong feedback, flag problematic outputs, highlight exemplary responses, and record insights that automated evaluators may miss. Human annotations get saved as part of the trace, helping you  guide future iterations of your application or agent."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "STGewsdG-75l"
   },
   "source": [
    "<video controls width=\"100%\">\n",
    "  <source src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/end-to-end-python-annotations.mp4\" type=\"video/mp4\">\n",
    "  Your browser does not support the video tag.\n",
    "</video>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "mxCPK9UwAdCB"
   },
   "source": [
    "# Iterate and Re-Run Experiment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Rdtt5aqV_2u6"
   },
   "source": [
    "Now that we've spent time analyzing the experiment results, it's time to iterate.\n",
    "\n",
    "We will update the agent’s main prompt to be more intentional about budget calculations, since that area received a lower eval score. You can modify the prompt or make any other adjustments you believe will improve the agent’s performance, and then re-run the experiment to see how the outputs improve—or where they may regress.\n",
    "\n",
    "Iteration is a key part of refining agent behavior, and each experiment provides valuable feedback to guide the next step. Once you reach eval scores and outputs that meet your expectations, you can confidently push those changes to production."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# --- Main Agent with Updated Instructions ---\n",
    "trip_agent = Agent(\n",
    "    name=\"TripPlanner\",\n",
    "    role=\"AI Travel Assistant\",\n",
    "    model=OpenAIChat(id=\"gpt-4o-mini\"),\n",
    "    instructions=(\n",
    "        \"You are a friendly and knowledgeable travel planner. \"\n",
    "        \"Combine multiple tools to create a trip plan including essentials, budget, and local flavor. \"\n",
    "        \"Keep the tone natural, clear, and under 1000 words. \"\n",
    "        \"When providing budget details: Ensure the final total budget range is mathematically consistent with the sum of the itemized ranges.\"\n",
    "    ),\n",
    "    markdown=True,\n",
    "    tools=[essential_info, budget_basics, local_flavor],\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "experiment = run_experiment(\n",
    "    dataset=dataset,\n",
    "    task=agent_task,\n",
    "    experiment_name=\"updated agent prompt to improve budget\",\n",
    "    evaluators=[relevancy_evaluator, budget_evaluator],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-GI6fXpCGJrq"
   },
   "source": [
    "In this case, the budget consistency score actually decreased, and the answer relevancy improved. By reviewing the traces and evaluation explanations, we can start to understand why this happened—perhaps the stricter prompt introduced new edge cases, or the agent over-corrected in unexpected ways.\n",
    "\n",
    "From here, we can continue the iterative process by forming a new hypothesis, applying changes based on what we’ve learned, and running another experiment. Each cycle helps refine the agent’s behavior and moves us closer to outputs that consistently meet the desired criteria.\n",
    "\n",
    "\n",
    "![Updated Results](https://storage.googleapis.com/arize-phoenix-assets/assets/images/end-to-end-python-tutorial-prompt-iteration.png)"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
